Supervised and unsupervised real-bogus classifiers for astronomical data

Zizheng Xu, Manish Reddy Vuyyuru, Yuhao Lu

Advanced Topics in Data Science

1. Motivation

Modern time-domain surveys monitor large swaths of the sky to look for interesting astronomical objects, including near-earth objects, unidentified planetesimals, and transits. Imperfect CCD, optics reflection, atmospheric effects and a bunch of other factors contribute to the presence of bogus detections. They contaminates real detections of transitions, Variable stars and planetesimals, adding to the difficulty against our effort to spot real interesting celestial objects.

Taking into account that we expect to observe ~200k objects in one single exposure in one of the four apertures on Pan-STARRS telescope, it is impossible for us to manually examine whether a detection is real or bogus. Also, sometimes visual examination helps, but does not solve the problem either, due to either huma error in such big volumes of data, and the cost of time. We need an automatic solution to distinguish bogus detections from real ones.

2. Executive Summary

We first present several supervised methods of distinguishing real and bogus detections. We demonstrate that these can be well trained to predict the match labels.

supervised models

Methods (Evaluated On A Single SMF File) True Negative Rate False Negative Rate
Logistic - L1 (All Linear Features) 0.811 0.098
Logistic - L1 (All Features Up To Power 2, Incl. Interaction Terms) 0.823 0.093
Logistic - L2 (Chosen Features) 0.837 0.100
SVC (Chosen Features) 0.762 0.089
Ensemble Learning
AdaBoost (Logistic - L2) 0.804 0.093
Bagging (Logistic - L2) 0.770 0.098
Bagging (SVC) 0.655 0.069
Evaluated On All SMF Files
Bagging (Logistic - L2) 0.741 0.107
Dense Neural Network 0.667 0.071

However, by being obsessed in training supervised methods to chase the match labels, one is missing the real question here. The main issue for Astronomists are, whether there is a method to tell whether a detection is real or not, rather than predicting whether a detection will be matched by catalog. Plus, the catalog matching label is relatively easy to get, compared to the the real/bogus label.

Based on the thought that real objects might share something in common over the parameter space, we implemented several unsupervised methods that demonstrate a promising direction for future work. We measured the performance of the unsupervised methods based on three quantities and tested their consistency over train/test split and over all the smf files we have.

We summarise the performance of the methods on real-bogus classification below:

Methods (Evaluated On a Single SMF File) Recall Rate Bogus Rate Edge-Bogus Rate
Baseline k-means 0.53 0.51 0.02
Initialized k-means 0.90 0.30 0.65
EM w. baseline prior 0.80 0.24 0.09
EM w. engineered prior 0.79 0.23 0.05
Variational Gaussian Mixture 0.77 0.43 0.29
Evaluated On All SMF Files
Variational Gaussian Mixture 0.76 0.33 0.21

3. Download required code and data

We directly use the smf files labelled with the stationary and moving objects from the catalog (labelled for us in Tutorial 3 Notebook). The files are hosted on Google Drive (see https://drive.google.com/open?id=12jYG_qgX6xp-X4iQnMStPJT9Av5oe-7l).

3. Imports

4. Definitions

4.1 EDA

We begin by looking at the differences in the distributions of each feature for detections that are tagged as a 'MATCH' and detections that are tagged as a 'NO MATCH'. We also look at the differences in the distributions for each feature for detections tagged as a 'EDGE' and detections that are tagged as a 'NO EDGE'.

We show below the differences in the distribution of PSF_CHISQ among 1. matched and not matched objects and 2. edge and non-edge objects.

We see a similar situation to the above in many of the other parameters. There is generally a difference in the distributions for 'MATCH' vs 'NO MATCH' and 'EDGE' vs 'NO EDGE'.

Some of the parameters more naturally represented in 2D, we investigate these appropriately. Consider some of these shown below.

We see that there is a tendency for 'MATCH' objects to lie within chip cells and 'NO MATCH' objects to lie on the edge of chip cells. We confirm this by comparing the plot on the left with the one on the right showing the edges on the chip cells (compare position of 'EDGE' vs. 'NO EDGE' objects).

We also see this clearly when considering the distributions of detections on the level of a RA-DEC plot. To see this more clearly, consider a zoom-in of the previous plot below.

Notice again that the tendency for 'MATCH' objects to land within chip cells and the tendency for 'NO MATCH' objects to land on the edge of chip cells. We also investigate other parameters that are naturally represented in 2 dimensions such as MOMENTS_XX and MOMENTS_YY or X_PSF_SIG and Y_PSF_SIG. Notice how these parameters show different structures in the parameter space for real versus bogus detections.

Unfortunately, while these features demonstrate some degree of seperation between 'MATCH' and 'NO MATCH' objects, they are highly correlated. Consider the pairwise correlation between features below.

Notice how the features are highly correlated with each other.

To sum up, we show that 'MATCH' objects share certain properties in the feature space that differentiate them from 'NO MATCH' objects. We face the problem of high correlation between features while could be a problem down the line when trying to interpret the results of supervised learning.

Either way, the results from the EDA seem sufficiently promising, and we will proceed with supervised methods for the task.

5. Supervised learning

We considered several supervised learning methods. We intend to cover the following methods:

  1. Linear Classifiers (Logistic Regression, Support Vector Classification)
  2. Ensembles (AdaBoost, Bagging)
  3. Neural Networks (Dense Networks)

We choose not to consider decision trees as these were already experimented on for this dataset per the tutorial 2 notebook.

Proportion of 'MATCH' and 'NO MATCH' Objects

We demonstrate that these methods are able to reach a sufficient level of performance in predicting the catalog labels. First let us begin by noting one of the challenges with this dataset. Consider below the ratios of 'MATCH' and 'NO MATCH' objects on a chip by chip basis.

Notice that the majority of detections within a chip (about 73% on average) are 'MATCH' objects. This imbalance in the dataset could lead to performance issues with the supervised methods. There are several ways to tackle this, such as upsampling the 'NO MATCH' objects in the dataset to balance the ratios. For most of the supervised methods (all except neural networks), we choose to tackle this by adjusting a weight associated with each class such that the weights are inversely proportional to class frequencies.

Metrics & Random Classifier Baseline

To get a sense for the difficulty of the labelling task, we being with a random classifier baseline. Given the expected ratio of 'MATCH' objects, we randomly assign labels keeping the same proportion of 'MATCH' objects. To measure the performance on this task, we propose two metrics:

  1. True Negative Rate

Based on input from the module leaders, we propose a successful classifier as one that is able to retrieve 70-80% of the 'NO MATCH' objects. This corresponds to a true negative rate of 0.7-0.8.

  1. False Negative Rate

Based on input from the module leaders, we propose a successful classifier as one that is able to misclassify fewer than 10% of the 'MATCH' objects. This corresponds to a false negative rate of < 0.1.

Notice that the two metrics 'compete' with each other. By setting up the task in this manner, we prevent trivial nonsensical solutions (such as 100% recall just by predicting the entire chip as 'MATCH' objects, etc.).

We evaluate the performance of the random baseline classifier on these two metrics.

Notice that the random classifier fails on the task.

  1. We can see visually that the 'NO MATCH' objects do not lie preferably on the edges of the chip cells.
  2. The metrics for the classifier do not come close to the proposed ranges for a successful classifier.

Notice that the classifier is much closer to the false negative target compared the the true negative target. This is likely because of the prevelance of the 'MATCH' objects compared to the 'NO MATCH' objects. Given the nature of the task, we assign more weight to the threshold for the false negative rates as we believe matches to the catalog to be very likely to be true.

Feature Selection

Given the large number of potential features in the dataset, we propose to use an automated feature selection method. We dump a logistic regression with L1 regularization on all the features to accomplish this.

From this, we retieve the features that were determined to be the most important across the chips. Notice here the similarity between our most importance features and those that were determined by a random forest classifier in notebook 2 reproduced below:

  1. PSF_QF_PERFECT
  2. SNR
  3. CR_LIMIT
  4. PSF_THETA
  5. SKY
  6. X_PSF
  7. X_PSF_SIG
  8. Y_PSF
  9. Y_PSF_SIG
  10. EXT_LIMIT
  11. NONLINEAR_FIT
  12. PASS1_SRC

This gives us confidence that in this classification task, the features determined by our method are indeed significant. Additionally, we also consider interaction terms and terms up to power 2, again by employing logistc regression with L1 regularization.

From this, as before, we obtain the most important interaction and power 2 terms across the chips. We will choose to use the top 19 linear features and top 19 interaction and power 2 terms. Below, we double check the pairwise correlation between our selected features.

We see that while the selected features do not exhibit the same extreme degree of correlation, they are still quite correlated. We will proceed with these features as is. However, if we intend to do further studies in deconstructing how a learned supervised classifier identifes a detection as real or bogus, it would be useful to revisit feature selection to select features that are not as correlated.

We will proceed with supervised methods trained on a chip by chip basis, unless otherwise stated. This is to reflect the fact that each chip behaves slightly differently. An alternative to this could have been to one-hot encode the chip ID.

Logistic Regression

We begin by using logistic regression with L2 regularization for the classification task. Logistic regression provides a probability for a data point to belong to either class based on a set threshold.

Given a feature vector X and weight vector W,

sigmoid(z)=11+eWX+B

We employ log likelihood and gradient descent in combination for optimization with a L1 or L2 regularized cost function. We also weight the cost function to reflect the prevelance of 'MATCH' and 'NO MATCH' objects in our data.

Notice that the simple logistic classifier performs fairly well on the task.

  1. We can see visually that most of the 'NO MATCH' objects line up nicely on edges of the chip cells.
  2. The true negative rate beats the threshold that we're willing to accept and the false negative rate is on our threshold.

It is quite surprising that a method as simple as the logistic classifier is able to achieve this performance. Additionally, we also evaluated a support vector classifier (SVC). We expect that the SVC might perform slightly better as it seeks to guarantee a decision boundary that is equi-distant between points from either class.

Support Vector Classifier (SVC)

The support vector classifier determines a decision boundary in the feature space that best seperates the data into two classes. We apply the standard kernel tricks. Again as before, we modify the cost function to reflect the prevalance of 'MATCH' and 'NO MATCH' objects in our data.

Again, we notice that the SVC classifier peforms fairly well on the task.

  1. We see 'NO MATCH' objects falling on the edges of chip cells.
  2. The true negative rates and false negative rates across the chips beats our thresholds with comfortable margins.

The SVC performs better than the logistic classifier. It is hard to draw direct comparisons between the two methods, so we will not guess why one performs better than the other. A natural progression from these individual linear classifiers was to employ an ensemble of these classifiers to improve their performance. We considered both AdaBoost and Bagging.

AdaBoost

AdaBoost employs an ensemble of classifiers with different weights. The ensemble F is constructed iteratively, where the weight θ for each data point is updated based on the performance of the previous classifiers f0,1,...,m1 in the ensemble on that data point. The ensemble classifier can be summarised as:

F(x)=mθmfm(x)

AdaBoost with Logistic Regressors performs fairly well on the task.

We see that there is not a significant change in the performance of the logistic classifier with AdaBoost. We trade off a lower true negative rate for a lower false negative rate. SVC is not suited for AdaBoost and the classification is done based on distance to nearest points between the two classes, there is no probability associated with the classification.

Bagging

We next considered the bagging ensemble method for logistic regression and SVC classifiers. In bagging, we combine an ensemble of models trained on bootstrapped samples of the training data.

Again, as was the case with AdaBoost, we see that employing a Bagging ensemble does not significantly affect the results. We see that compared to a standalone logistic regressor, the bagging classifier offers a slightly lower false negative rate for a lower true negative rate. We are willing to accept this tradeoff given our priorities for the true negative vs. false negative rates as described previously. This is our best classifier thus far. We also consider a bagging ensemble of SVC classifiers.

We see that a bagging ensemble of SVC classifiers, as was the case of ensembles of logistic regressor classifiers, does not significantly affect the performance. We still see that the classifier performs decently, trading off a lower false negative rate for a lower true negative rate.

We then ran the best classifier thus far, a bagging ensemble of logistic regressors, on the full dataset.

We see that we obtain acceptable, but not spectacular performance on the true negative and false negative rates across the chips. We see a fairly consistent performance on these metrics that beats our thresholds.

Neural Network Classifiers

Next, we consider a dense neural network for the classification task. Notably,

  1. Instead of adjusting the weights for the different classes in the loss function to account for the imbalance in the dataset, we upsample the 'NO MATCH' objects in the dataset to balance their relative prevalance.

  2. We train a single neural network to classify detections across all chips.

We see that the neural network's performance does not match the other supervised methods. While it exhibits good false negative rates, it overpredicts detections as 'MATCH' objects (indicated by the very low false negative rate and very low true negative rate).

4. Unsupervised Methods

We tried the following models:

From the EDA part, we can see that the features in the feature space is separable, and a gaussian approximation to the distribution is acceptable.

Baseline Kmeans:

We chose the most significant features that we discovered in previous supervised learning part, and pass them into kmeans algorithm:

This model is quite successful in some chips, but since we cannot control the labels due to random initialization of clusters, the cluster labels are randomly flipped.

Kmeans with initialization

With the lesson learned in the baseline model, we proposed an improvement on that. The initial assignment of dots are enforced by MATCH label. We got the benefit of exerting good initial guess of cluster centers, and at the same time, in the first cluster we expect real detections, and in the second one bogus.

The recall rate is good!

Here are some more explainations on the new metric, ( EDGE | MATCH,cluster=bogus )

This quantity captures the percentage of EDGE==1, among those clustered to be bogus but still matches catelog.

This phenomenon arises because those MATCH==1 labels are not perfect. On cell edges there are more detections, but the distribution of stars in the sky in uniform in a small portion of the sky. Therefore, the excessive portion of detections in cell edges is highly likely to be bogus. In the mean time, because the high density of detections on cell edges, there is possibility that bogus points matches catelog by coincidence, therefore mistakenly labelled as MATCH. This metric shows that our clustering procedure can not only get real detections from those thrown away by catelog matching, but also spot bogus detections that happened to MATCH.

For the metric P(cluster=bogus), is a proof read of ensuring we are not trivially predicting everything on our chip to be real.

Expectation-Maximization algorithm

To go beyond linear decision boundaries, and to allow for different variances in features, we adopteded the EM algorithm, with initialization that is provided by kmeans.

It seems that in some chips the label flipped and the recall rate is not ideal. This is because I set a prior which is too wide, namely

precisions_init = np.stack([np.diag(Xinit.values[0]),
                                np.diag(Xinit.values[1]) ],axis=0)/15

we will try a more strict prior on the second cluster, which is initialized by MATCH, meanwhile keeping the wide prior for the first cluster unchanged.

precisions_init = np.stack([ np.diag( 1/Xinit.values[0] /10  ),
                                np.diag( 1/Xinit.values[1] /8  ) ],axis=0)

We see that by enforcing a strict prior on the cluster which is meant to be real, we observe in some chips, the recall rate achieved 100%. This means that some chips have very good feature consistency among all MATCHed object, while others do not. This EM algorithm is very promising in solving this semi-supervised learning problem as long as we have very confident MATCH labels.

Variational gaussian mixture models

The structue of Variational gaussian mixture models is like following:

hhhh

hhh

Bayesian Gaussian mixture model using plate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K.

p.c. wikipedia

This model associates all parameters to random variables, and .

Firstly, we have β to represent the possibility of one observation to be true or bogus.

Secondly, the center and variance of clusters (μk,σk), are generated by gaussian distributions characterized by parameters μ, ν, λ, σ02.

Given the cluster centers, observations are drawn from its corresponding cluster (μk,σk).

This model is telling a more comprehensive story of how we observe the data we have collected. Since we are interested in telling whether a detection is real or bogus, therefore this model is well suited to our task

This model is outperforming every model that we have tried previously, as we can read from the histograms of metrics, having reasonable MATCH recalls, high edgebogus rates, but realatively high rejection rates.

Metrics for Unsupervised Methods

We have measured recall rate of "MATCH", P(EDGE|MATCH=1,cluster=bogus), P(cluster=bogus) in the cells, and they turn out very good. As a proof of the stability and self-consistency of the algorithm, we made a train-test split on each chip. We cluster on the training set, and use the trained decision boundary to predict in the test set. In each individual chip, we collect the three pairs of measures, and made paired t-test over the 60 measurements for each chip. We repeat the process for each smf file, and collected the p-values of those t-tests

We see that the distribution of p-values are well above 0.05 significance level, therefore we can say that variational Gaussian mixture models are yielding quite self-consistent result.

summary for unsupervised models

Starting from the assumption that real/bogus detections hold distinguishable patterns in feature space, we made unsupervised models in the endeavor to figure out which detections are real whereas which are bogus. Our models include:

Using any of the models above, we can expect two favorable outcomes:

  1. Pick up real detections from non-matches: unidentified planetesimals, moving near-earth objects are usually hard to find in catelogues. But we definitely do not want them to slip away due to misclassification. Using those models, we can pick interested objects from our new labels, and keep track of them over time.

  2. Spot bogus detections in MATCH. Even if there is a match in position with catelogue, we still cannot guarantee that this object is real, because bogus detections have a chance to appear right at the place of a catalogued object.

6. Discussion

In order to truely understand how well our unsupervised methods are performing with regard to solving the problem of picking bogus detections from real ones, we need authentic labels that indicate whther this detection is real or not. But sadly such infomation is not easy to get, and is expected to take too much time that Astronomers can afford. Therefore, although the above unsupervised result is self-consistent and performing well, we still worry about how this result compare to ground truth.

To collect ground thruth data, possible practices involve manual examination of each detection, and label real/bogus carefully according to not only position but also luminosity, shape, parallax and other parameters that could help distinguish bogus detections from real ones. We expect the real labels to come out some day.